Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
نویسندگان
چکیده
We examine novel fault tolerance schemes for data loss in multigrid solvers which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques, that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques.
منابع مشابه
A Consensus-Based Fault-Tolerant Event Logger for High Performance Applications
High-performance computing (HPC) systems traditionally employ rollback-recovery techniques to allow faulttolerant executions of parallel applications. Rollback-recovery based on message logging is an attractive strategy that avoids the drawbacks of coordinated checkpointing in systems with low mean-time between failures (MTBF). Most message logging protocols rely on a centralized event logger t...
متن کاملModeling of Hierarchical Distributed Systems with Fault-Tolerance
Absfracf-This paper addresses some fault-tolerant issues pertaining to hierarchically distr ibuted systems. Since each o f the levels in a hierarchical system could have various characteristics, different faulttolerance schemes could he appropriate at different levels. I n this paper, we use stochastic Pet r i nets (SPN's) to investigate various faulttolerant schemes in this context. The basic ...
متن کاملAsynchronous parallel solvers for linear systems arising in computational engineering
Modern trends in Computational Science and Engineering are moving towards the use of computer systems with ever increasing numbers of computational cores. A consequence of this is that over the next decade it will be necessary to develop and apply new numerical algorithms that are far more scalable than has historically been required. Ideally, such algorithms will be able to exploit many thousa...
متن کاملA Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend...
متن کاملMetapromela: A Toolkit for Simulation of Checkpointing Algorithms
Distributed checkpointing algorithms play an important role in the majority of the fault tolerant software components existent today. Unfortunately, there is a lack of comprehensive and uniform performance testing of those algorithms. Our research focuses on the provision of a toolkit, Metapromela, that helps with the implementation and testing of distributed checkpointing algorithms. This pape...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Parallel Computing
دوره 49 شماره
صفحات -
تاریخ انتشار 2015